Estimating Collection Size in Distributed Search
نویسندگان
چکیده
Distributed search is an effective way to search information over thousands of information collections available on the web. As an important feature in distributed search, collection size plays a vital role in resource representation and selection. This paper proposes two novel algorithms to estimate collection size in uncooperative environments. Sample high frequent resample (SHFRS) algorithm firstly samples collections with random queries and then resamples with highest frequent queries in sample sets. Considering different capture probabilities across documents, heterogeneous capture (HC) algorithm estimates collection size with conditional maximum likelihood. Both algorithms are evaluated on real web data. Experimental results show that our algorithms outperform significantly both sample-resample and capture-recapture algorithms.
منابع مشابه
How Much Data Resides in a Web Collection: How to Estimate Size of a Web Collection
With increasing amount of data in deep web sources (hidden from general search engines behind web forms), accessing this data has gained more attention. In the algorithms applied for this purpose, it is the knowledge of a data source size that enables the algorithms to make accurate decisions in stopping crawling or sampling processes which can be so costly in some cases [4]. The tendency to kn...
متن کاملDistributed Generation Effects on Unbalanced Distribution Network Losses Considering Cost and Security Indices
Due to the increasing interest on renewable sources in recent years, the studies on integration of distributed generation to the power grid have rapidly increased. In order to minimize line losses of power systems, it is crucially important to define the size and location of local generation to be placed. Minimizing the losses in the system would bring two types of saving, in real life, one is ...
متن کاملDistributed Generation Effects on Unbalanced Distribution Network Losses Considering Cost and Security Indices
Due to the increasing interest on renewable sources in recent years, the studies on integration of distributed generation to the power grid have rapidly increased. In order to minimize line losses of power systems, it is crucially important to define the size and location of local generation to be placed. Minimizing the losses in the system would bring two types of saving, in real life, one is ...
متن کاملEstimating the size of search trees by sampling with domain knowledge
We show how recently-defined abstract models of the Branch & Bound algorithm can be used to obtain information on how the nodes are distributed in B&B search trees. This can be directly exploited in the form of probabilities in a sampling algorithm given by Knuth that estimates the size of a search tree. This method reduces the offline estimation error by a factor of two on search trees from Mi...
متن کاملEstimating Size of Search Engines in an Uncooperative Environment
The number of documents that are indexed by a search engine is referred to as the size of the search engine. The information about the size of each underlying search engine is essential for any metasearch engine to conduct search engine selection, result merging and a few other processes. Thus, effectively estimating the size of search engines is important for a metasearch engine that incorpora...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007